Search CORE

18 research outputs found

SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

Author: Behera Sairam
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 04/12/2020
Field of study

In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu

A consensus‑based ensemble approach to improve transcriptome assembly

Author: Behera Sairam
Cahoon Edgar B.
Deogun Jitender S.
Kapil Kushagra
Li Xiangjun
Moriyama Etsuko N.
Shanklin John
Voshall Adam
Yu Xiao‑Hong
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2021
Field of study

Background: Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes. Results: In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble. Conclusions: Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genomeguided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from: http:// bioin folab. unl. edu/ emlab/ conse mble/

DigitalCommons@University of Nebraska

Directory of Open Access Journals

Divergent evolution of extreme production of variant plant monounsaturated fatty acids

Author: Behera Sairam
Cahoon Edgar B.
Cai Yuanheng
Chai Jin
Gan Lu
Kaewsuwan Sireewan
Kim Hyojin
Liu Qun
Moriyama Etsuko
Mower Jeffrey P
Park Kiyoul
Shanklin John
Updike Evan M.
Voshall Adam
Wilson Mark A
Yu Xiao-Hong
Zhang Chi
Zhang Chunyu
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 12/06/2022
Field of study

Metabolic extremes provide opportunities to understand enzymatic and metabolic plasticity and biotechnological tools for novel biomaterial production. We discovered that seed oils of many Thunbergia species contain up to 92% of the unusual monounsaturated petroselinic acid (18:1Δ6), one of the highest reported levels for a single fatty acid in plants. Supporting the biosynthetic origin of petroselinic acid, we identified a Δ6-stearoyl-acyl carrier protein (18:0-ACP) desaturase from Thunbergia laurifolia, closely related to a previously identified Δ6-palmitoyl-ACP desaturase that produces sapienic acid (16:1Δ6)- rich oils in Thunbergia alata seeds. Guided by a T. laurifolia desaturase crystal structure obtained in this study, enzyme mutagenesis identified key amino acids for functional divergence of Δ6 desaturases from the archetypal Δ9-18:0-ACP desaturase and mutations that result in nonnative enzyme regiospecificity. Furthermore, we demonstrate the utility of the T. laurifolia desaturase for the production of unusual monounsaturated fatty acids in engineered plant and bacterial hosts. Through stepwise metabolic engineering, we provide evidence that divergent evolution of extreme petroselinic acid and sapienic acid production arises from biosynthetic and metabolic functional specialization and enhanced expression of specific enzymes to accommodate metabolism of atypical substrates

DigitalCommons@University of Nebraska

PubMed Central

The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms

Author: Agustinho Daniel Paiva
Aliyev Elbay
Avdeyev Pavel
Barrozo Enrico R.
Behera Sairam
Billingsley Kimberley
Busby Ben
Chen Guangyi
Chong Li Chuin
Choubey Deepak
Dabbaghie Fawaz
De Coster Wouter
Fu Yilei
Gener Alejandro R.
Hefferon Timothy
Henke David Morgan
Höps Wolfram
Illarionova Anastasia
Jochum Michael D.
Jose Maria
Kalra Divya
Kesharwani Rupesh K.
Khleifat Ahmad Al
Kolora Sree Rohit Raj
Kubica Jedrzej
Lakra Priya
Lattimer Damaris
Liew Chia-Sin
Lo Bai-Wei
Lo Chunhsuan
Lowdon Rebecca
Lötter Anneri
Mahmoud Medhat
Majidian Sina
Mendem Suresh Kumar
Molik David
Mondal Rajarshi
Ohmiya Hiroko
Parvin Nasrin
Paulin Luis F.
Peralta Carolina
Pfeifer Susanne P.
Poon Chi-Lam
Prabhakaran Ramanandan
Raza Muhammad Sohail
Saitou Marie
Sammi Aditi
Sanio Philippe
Sapoval Nicolae
Sedlazeck Fritz J
Soto Daniela C.
Syed Najeeb
Treangen Todd
Walker Kimberly
Wang Gaojianyong
Xu Tiancheng
Yang Jianzhi
Zhang Shangzhe
Zhou Weiyu
Publication venue: 'F1000 Research Ltd'
Publication date: 01/01/2022
Field of study

publishedVersio

Brage NMBU

PubMed Central

UPSpace at the University of Pretoria

SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

Author: Behera Sairam
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2020
Field of study

DigitalCommons@University of Nebraska

Suffix Tree, Minwise Hashing and Streaming Algorithms for Big Data Analysis in Bioinformatics

Author: Behera Sairam
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2020
Field of study

In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) suffix-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (KmerEstimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use suffix tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs (≥ 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods

DigitalCommons@University of Nebraska

Suffix Tree, Minwise Hashing and Streaming Algorithms for Big Data Analysis in Bioinformatics

Author: Behera Sairam
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2020
Field of study

FixItFelix: improving genomic analysis by fixing reference errors

Author: Behera Sairam,
Publication venue
Publication date: 12/07/2023
Field of study

Ezid

A Comparison of a Campus Cluster and Open Science Grid Platforms for Protein- Guided Assembly using Pegasus Workflow Management System

Author: Begcy Kevin
Behera Sairam
Campbell Malachy
Deogun Jitender S.
Pavlovikj Natasha
Walia Harkamal
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 01/01/2014
Field of study

Scientific workflows are a useful tool for managing large and complex computational tasks. Due to its intensive resource requirements, the scientific workflows are often executed on distributed platforms, including campus clusters, grids and clouds. In this paper we build a scientific workflow for blast2cap3, the protein-guided assembly, using the Pegasus Workflow Management System (Pegasus WMS). The modularity of blast2cap3 allows us to decompose the existing serial approach on multiple tasks, some of which can be run in parallel. Afterwards, this workflow is deployed on two distributed execution platforms: Sandhills, the University of Nebraska Campus Cluster, and the Open Science Grid (OSG). We compare and evaluate the performance of the built workflow for the both platforms. Furthermore, we also investigate the influence of the number of clusters of transcripts in the blast2cap3 workflow over the total running time. The performed experiments show that the Pegasus WMS implementation of blast2cap3 significantly reduces the running time compared to the current serial implementation of blast2cap3 for more than 95 %. Although OSG provides more computational resources than Sandhills, our workflow experimental runs have better running time on Sandhills. Moreover, the selection of 300 clusters of transcripts gives the optimum performance with the resources allocated from Sandhills

Crossref

DigitalCommons@University of Nebraska

Recommended from our members

FixItFelix: improving genomic analysis by fixing reference errors

Author: Behera Sairam
Dennis Megan Y
Farek Jesse
LeFaive Jonathon
Mahmoud Medhat
Orchard Peter
Parker Stephen CJ
Paulin Luis F
Sedlazeck Fritz J
Smith Albert V
Soto Daniela C
Zook Justin M
Publication venue: eScholarship, University of California
Publication date: 01/01/2023
Field of study

The current version of the human reference genome, GRCh38, contains a number of errors including 1.2 Mbp of falsely duplicated and 8.04 Mbp of collapsed regions. These errors impact the variant calling of 33 protein-coding genes, including 12 with medical relevance. Here, we present FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates. We showcase these improvements over multi-ethnic control samples, demonstrating improvements for population variant calling as well as eQTL studies

eScholarship - University of California